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DESCRIPTION 
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Text segmentation and label assignment with user interaction by means of topic specific 
language models and topic-specific label statistics 



of the invention— — 

The present invention relates to the field of generating structured documents from 
unstructured text by segmenting unstructured text into text sections and assigning a 
label to each section as section heading. The text segmentation as well as the 
assignment of labels to text sections also denoted as labelling is provided to a user 
having control of the segmentation and the labelling procedure. 



Background and prior art 

Text documents that are generated by a speech to text transcription process usually do 
not provide any structure since conventional speech to text transcription systems or 
speech recognition systems only literally transcribe the recorded speech into 
15 corresponding text. Explicitly dictated commands of text formatting, text highlighting, 
punctuation or text headings have to be properly recognized and processed by the 
speech recognition system or by a text formatting procedure being successively applied 
to the text generated by the speech recognition process. 

20 Both automatic speech recognition as well as automatic text formatting systems that are 
typically based on training data and/or manually designed text formatting rules 
inevitably produce errors because of a lack of human expertise which is needed to 
properly identify complex formatting commands, section boundaries as well as distinct 
text portions, e.g. representing a section heading. The result of an ordinary speech to 

25 text transcription process or text formatting process therefore has to be provided to a 
human proofreader. The proofreader has to browse through the entire document 
thereby gathering information about the content of the document and to decide whether 
the speech to text transcription process produced reasonable results and whether a text 
formatting has been performed correctly with respect to the content of the document. 



2 
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The task of the proofreader even aggravates when the structure of a document is not 
explicitly dictated, i.e. many headings and section boundaries are not explicitly encoded 
in the spoken dictation. Furthermore, when even sentence structures, i.e. punctuation 
5 .. - ^l 501 ^ a*? rarely dictated, these punctuation symbols have to be manually inserted by 
the proof reader. 



Especially the partitioning of a text into sections is a rather demanding task for a proof 
reader because the detection of a change of a section type cannot be decided before a 
longer part of the new section has been read by the proofreader. Here the proofreader 
has to jump back to some position in the already examined text in order to insert a 
section boundary and an appropriate heading. In particular the permanent jumping 
between different positions in the document is very time consuming and exhausting for 
the human proofreader. 



The present invention aims to provide a method, a computer program product, a text 
segmentation system as well as a user interface for a text segmentation system in order 
to perform a segmentation and labelling of an unstructured text in response to a user's 



decision. 



Summary of the invention 

The present invention provides an efficient user interface for a text processing system 
which employs a method of segmentation of a text into text sections, of assigning a 
topic to each section, and of assigning a label in form of a section heading to each text 
section. These tasks are performed using statistical models which are trained on the 
basis of annotated training data. First, the method performs a segmentation of the text 
into text sections by making use of the statistical models extracted from the training 
data. After the text has been segmented into text sections, each text section is assigned 
with a topic being indicative of the content of the text section. The assignment of the 
topic to a text section makes also use of the statistical models extracted from the 
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training data. After the text segmentation and the topic assignment has been performed, 
a structured text is generated by inserting a label as a section heading into the text The 
label is inserted in the text at a position corresponding to a section border in such a way 
that the label is directly followed by the section it refers to. This inserted label is to be 
5 understood as a heading which precedes the following text section. 

When the structured text has been generated in the above described way, the structured 
text is provided to a user having control of the segmentation, the topic assignment and 
the general structuring of the text. The method finally performs modifications of the 
10 structured text in response to the user's review. 

According to a preferred embodiment of the invention, the insertion of labels as section 
headings comprises a text formatting procedure incorporating formatting steps such as 
punctuating, highlighting, indenting and modifying the type face. 

15 

According to a further preferred embodiment of the invention, the topic assignment to a 
text section also comprises the assignment of a set of labels to the text section. One 
label of the set of assigned labels is finally inserted as a section heading into the text. 
Here, a topic represents a rather abstract declaration of a distinct class or type of section. 
20 Such a declaration is particularly applicable to so-called organized documents following 
a typical or predefined structure. For example a medical report features a topic sequence 
like demographic header, patient history, physical examination and medication. 

Each section of such a structured document can be identified by an abstract topic. In 
25 contrast to the abstract topic, a label is indicative of a concrete heading of such a 
section. For example the section referring to an examination of the patient can be 
labelled in a plurality of different ways, such as '"physical examination", "examination", 
"exam", "surgical examination". No matter how a section of text is labelled, the content 
of the section, Le. in this case an examination, is identified by the assigned topic. 

30 
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The segmentation of the text into text sections can for example be performed by a 
method disclosed in US Pat. No. 6,052,657 making use of language models and 
language model scores in order to indicate a correlation between a block of text and a 
language model. A more accurate and reliable procedure for text segmentation and topic 
5 assign ment i s disclosed in the patent application "text segmentation and topic 

annotation for document sectoring" filed by the same applicant herewith concurrently. 
This document describes a statistical model for text segmentation and topic annotation 
by making explicit use of a topic sequence probability, a topic position probability, a 
section length probability as well as a text emission probability. These probabilities are 
1 0 especially helpful when the underlying annotated training data refer to organized 
documents. 



According to a further preferred embodiment of the invention, the assignment of one 
label of the set of labels to a text section, and inserting the one assigned label as a 

15 section heading of the text section into the text, accounts for count statistics based on 
the training data and/or explicit or partial verbalizations found at the beginning of a 
section. The count statistics reflects the observed frequency that a section assigned to 
some topic is preceded by a specific label. Iti this way, the most frequently assigned 
label per topic may be selected as a default heading if no other hints about the most 

20 suitable label or heading are found in the text In other words by means of a count 
statistic a default label is assigned to a text section. 



Alternatively, the label assignment based on the count statistic is overruled when an 
explicit verbalization is found at the beginning of a section exactly matching one of the 
25 set of labels being assigned to the section. Furthermore, if no label matches exactly with 
an explicit verbalization found at the beginning of a section, then a label matching only 
partially some verbalization found at the beginning of the section may be inserted 
instead of the default label. The assignment of one label to a text section, i.e. the 
selection of one label of the set of labels being assigned to the text section, can also be 
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performed with respect to the count statistics based on the training data in combination 
with explicit full or partial verbalizations found at the beginning of a section. 

According to a further preferred embodiment of the invention, if some full or partial 
5 verbalization is found at the beginning of the section, this verbalization may be removed 
from the section. This is useful, if the verbalization represents an explicitly dictated 
heading which is replaced by the inserted label. As an example, a section starting with 
'^medications the patient takes . . can be assigned to the label "medications" Since this 
label serves as a heading for the subsequent section, the term ''medications" itself 

10 should be removed from the text of the section leaving the proper content of the section 
starting with "the patient takes . . .". Modifications of this strategy include the removal 
of some predefined filler words which may be part of the dictated heading or initial 
phrase of some section, even if these filler words are not part of the label, e.g. if some 
section starts with "medications are X, Y, Z, . . ." which is converted into the heading 

15 "medications" followed by the list of medications "X, Y, Z, . . ." where the filler word 
"are" is skipped. 

According to a further preferred embodiment of the invention, the insertion of a section 
heading into the text e.g. due to an exact matching between an explicit verbalization and 
20 a label can be overruled by the user. In this case, the insertion is reversed by the method 
and the original text portion is restored. More specifically, if some section-initial words 
have been removed due to a match with the assigned label, these words have to be re- 
inserted when the user decides for a different label which does not match these removed 
words. 

25 

According to a further preferred embodiment of the invention, the providing of the 
structured text to a user further comprises providing of the complete set of labels being 
assigned to each text section. Since each label of the set of labels represents an 
alternative for the section heading, the user can easily compare the automatically 
30 inserted section heading with alternative headings. 
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According to a further preferred embodiment of the invention, the providing of the 
structured text to a user further comprises providing indications of alternative section 
borders. In this way not only the section borders automatically inserted into the text by 
5 the present method are visib le to the user, but also alternative section borders are 
provided to the user for an easier and facilitated proofreading. In this way the proof 
reader's task to find the correct section borders of the document is reduced to the 
retrieval of automatically inserted section borders and alternative section borders. 

1 0 According to a further preferred embodiment of the invention, modifications of the 
structured text in response to the user's review refer to the modification of the 
segmentation of the text into text sections and/or modifications of the assignment 
between labels and text sections. Furthermore modifications of performed formatting 
such as punctuation, highlighting and the like are also conceivable. 

15 

According to a further preferred embodiment of the invention, modifications of the text 
segmentation and modifications of the assignment of labels to text sections performed 
in response to the user's review are initiated by the user selecting one of the provided 
labels or one of the alternative section borders. The modification selected by the user is 
20 then performed by the present method, replacing a section heading by a selected section 
heading, or shifting a section border. 

Accomplishing a first text modification may imply that a second text modification has 
to be performed. For example when the section headings are enumerated, the removal of 
25 a text section requires a re-enumeration of the successive text sections or section labels. 
Therefore, the present invention is further adapted to dynamically perform 
modifications that are due to a prior modification performed in response to the user's 
review. 
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According to a further preferred embodiment of the invention, a modification of the 
assignment of a label to a text section as a section heading is performed in response of 
the user either selecting one label of the provided set of labels being assigned to the text 
section or by entering a user defined label and assigning this user defined label as 

5 section heading to the text section. In this way the user can quickly and effectively 

identify one label of the provided set of labels as the correct section heading or 
alternatively define a previously unknown heading to the relevant text section. 

The selection of one label of a set of a labels as well as the entering of a label is not 
10 restricted to positions in the text that were identified as section borders but moreover an 
appropriate set of labels can be provided at any position in the text upon user request In 
this way the user still has complete control of structuring and labelling the document 

According to a further preferred embodiment of the invention, the processing of 
1 5 modifications in response to the user's review successively triggers a re-segmentation of 
the text into text sections and a regeneration of a structured text by inserting labels as 
section headings referring to text sections. Both the re-segmentation as well as the 
regeneration of the structured text make use of the statistical models extracted from the 
training data and make reference to already performed modifications that were 
20 processed in response to the user's review. When for example a user has introduced a 
modification in the text either in the form of a redefinition of a section border or in the 
form of re-labelling a section heading, the method of the present invention performs a 
successive re-segmentation and regeneration of the structured text by leaving the 
initially performed modifications of the user unaltered. In this way modifications 
25 introduced by the user are never overruled or re-modified by the inventive method. 

According to a further preferred embodiment of the invention, the re-segmentation of 
the text into sections as well as the regeneration of the structured text by inserting labels 
as section headings is performed dynamically during a review process performed by the 
30 proofreader or user. The re-segmentation of the text as well as the regeneration of the 
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structured text can either be applied to all text sections, to the current and all following 
sections, or to a single section if specified by the user. For example when a new section 
boundary is introduced or when a heading is removed by the user, it is reasonable that a 
further restructuring or heading update is restricted to the current section only. In this 
5 way the method can faster respond to small, hence local changes that have to be 
introduced into the text 

According to a further preferred embodiment of the invention, the granularity of the text 
segmentation can be controlled by the user by customizing a so-called granularity 
10 parameter. In this way the user can determine whether the text is structured in a finer or 
coarser way. A change of the customizable granularity parameter results into removal or 
insertion of text sections. 

According to a further preferred embodiment of the invention, modifications that are 
1 5 performed in response to the user 9 s review are logged and analyzed by the present 

method in order to further train the statistical models. In this way the entire method can 
effectively be adapted to the user's preferences. When for example a distinct label has 
been repeatedly removed from the text by the user, the method of text segmentation 
restrains to insert this distinct section heading in future applications. The impact of the 
20 user's modification on the adaptation of the method — hence the sensitivity of the 
adaptation — may be also controlled by the user. This means that for example an 
insertion or a removal of a label has to occur several predefined times before the method 
adapts to this particular user introduced modification. The number of how often a 
change has to manually inserted until the method adapts to the introduced change may 
25 be given by the user. 

Furthermore, the adaptation of the method towards user introduced modifications can 
already refer to successive sections in the present document. The method adapts to 
modifications introduced by the user in the beginning part of a document and 
30 automatically performs corresponding modifications within successive text sections. 
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The adaptation therefore applies to a present document as well as to future documents to 
which the inventive method is applied to. 

» • 

Brief description of the drawings 

5 la the following, preferred embodiments of the invention will be described in greater 
detail by making reference to the drawings in which: 



Figure 1 illustrates a flow chart of the segmentation method of the present 

invention, 

1 0 Figure 2 illustrates a flow chart for text segmentation incorporating analyzation of 

user introduced modifications, 
Figure 3 illustrates a flow chart of an implementation of the present invention into 

a speech recognition process, 
Figure 4 shows a block diagram of the user interface of the present invention, 
1 5 Figure 5 shows a block diagram of the segmentation system. 



Detailed Description 

Figure 1 illustrates a flow chart of the text segmentation and topic assigning method hi 
the first step 100 an unstructured text generated e.g. by a speech to text transcription 

20 system is inputted. Based on the inputted text, in step 102 the method performs a 

structuring and topic assignment of the text by segmenting the text into text sections and 
assigning a topic to each text section. In order to perform the text segmentation and 
topic assignment in step 102, language or statistical models being extracted from 
training data axe provided to step 102 by step 104. Step 105 provides a label count 

25 statistics indicating the probability that a label is assigned to a topic. Based on the 

training data, the label count statistics reflects how often a label is assigned to a topic. 



In step 106 a label is assigned to each text section as a section heading and inserted at 
the appropriate position into the text by making reference to the count statistics 
30 provided by step 105 and the segmented text provided by step 102. After the label 
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assignment has been performed by step 106 the segmented text and the inserted labels 
as well as alternative labels are provided to a user in step 108. Furthermore alternative 
section boundaries are provided to the user in step 108. In the successive step 1 10 the 
user decides whether the provided segmentation and label assignment of step 108 is 
5 acceptable. Alternatively the user can select alternative headin gs prov ided by step 108 
or alternative segmentations provided by alternative section boundaries. 

If none of the provided alternatives satisfies the user's preferences, the user can also 
enter a section boundary as well as a section heading. In response to the user's decision 

10 of step 1 10, the user's decision is processed by the method in step 1 12. Processing of 
the user's decision comprises replacing inserted section headings, re-labelling 
successive section headings, restructuring successive or part of the document or 
restructuring and re-labelling the entire document. Furthermore a dynamical processing 
of user introduced modifications is also conceivable. Dynamic processing means, that a 

1 5 user introduced modification automatically triggers further modifications that are 

related to proceeding text sections or modifications to be performed during a successive 
application of the stmcturing method- 

After the user decision has been processed in step 1 12 the resulting modifications are 
20 performed in the following step 1 14. 

Figure 2 is illustrative of a flow chart of the text segmentation and text assignment 
method incorporating analyzation of user introduced modifications. In a first step 200 
an unstructured text resulting from e.g. a speech to text transcription process, is 
25 provided to step 202. In step 202 a text segmentation into text sections is performed by 
making use of language or statistical models provided by step 204. Furthermore in step 
202 a topic is assigned to each text section by making use of the statistical information 
stored in the language model provided by step 204. 
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After the text has been segmented into text sections and after each text section has been 
assigned to a topic in step 202, in the proceeding step 206 a label is assigned to each 
text section as a section heading and inserted at the appropriate position in the text The 
assignment of a label performed in step 206 makes explicit use of the label count 
5 statistics being provided to step 206 by step 205. Based on the training data, the label 
count statistics reflects how often a label is assigned to a topic. 

After the text has been structured by means of segmenting the text into text sections, 
assigning a topic to each text section and further assigning a label to each text section, 

10 the segmented text, the assigned headings as well as alternatives are provided to a user 
in step 208. The alternatives provided to the user refer to alternative text segmentations 
as well as alternative section labels. In the proceeding step 210 the user decides whether 
to accept the performed segmentation of the text and the performed assignment of 
section labels or to select one of the provided alternatives. Furthermore the user can also 

1 5 enter an arbitrary segmentation as well as an arbitrary section heading according to his 
or her preference. After the user decision of step 210, in the following step 214 the 
method checks whether any modifications have been introduced by the user. When in 
step 214 no user introduced modification has been detected the method ends in step 218 
resulting in a structured and labelled text as performed in step 206. In contrast when in 

20 step 214 a user introduced modification has been detected, the method proceeds with 
step 212 in which the user introduced modifications are processed and performed The 
processing and performing of a user's decisions incorporates a multiplicity of different 
text segmentation, text labelling as well as text formatting procedures. 

25 After the user decision has been processed and performed in step 212 the method 
proceeds with step 216. In step 216 the user introduced modifications are stored as 
external conditions for a next application of the structuring and assigning procedure. 
Depending on the type of user modification referring to the text structuring or to the 
label assignment of text sections after step 216, the method either returns to step 202 or 

30 to step 206 in which a new structuring or a new label assignment is performed. 
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In a similar way a new restructuring and reassignment of the text performed by step 202 
and 206 explicitly accounts for already performed modifications provided by step 216, 
In this way it can be guaranteed that user performed modifications are never overruled 
5 by the text structuring step 202 and the label assignment step 206. 

Figure 3 is illustrative of an implementation of the text segmentation and topic 
assignment method into a speech recognition system. Id step 300 speech is inputted into 
the system. In the following step 302 a first portion of the speech, p=l is selected. The 

10 first portion of speech selected by step 302 is provided to step 304 performing a speech 
to text transcription by making use of a language model m. The language model m is 
provided by step 306 to step 304. After the speech portion p has been transcribed into a 
text portion t by step 304, the resulting text portion t corresponding to the speech 
portion p is stored in step 308. la the proceeding step 310 the speech portion index p is 

15 compared to p m£CC indicating the last portion of the speech. If p is less than pmax, p is 
incremented by 1 and the method returns to step 304. The steps 304, 308 and 310 are 
repeatedly applied until the speech portion index p equals the last speech portion p^. 
In this case the entire speech signal has been transcribed into text. The resulting text 
then comprises a plurality of text portions t corresponding to the portions of the speech. 

20 p. 

Based on the transcribed text, in step 312, a segmentation of the text into text sections is 
performed and each of the text sections is assigned to a topic being specific of the 
content of each section. This segmentation procedure of step 312 makes use of 

25 statistical models designed for text segmentation that are provided to step 3 1 2 by step 
3 14. When the text has been segmented and assigned to topics in step 312, in the 
succeeding step 316, the topic assigned to each text section as well as the corresponding 
speech portions p' of the text sections are determined. Based on this determination, a 
second speech recognition of the speech portions p' referring to a distinct section can be 

30 performed in the following step 318. Depending on the topic being assigned to a text 
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section, a topic specific language model for the second speech recognition is provided 
by step 306. Since the speech has been transcribed stepwise in the procedure described 
by the steps 300 through 3 10, a repeated speech recognition can selectively be 
performed for distinct sections of text that correspond to speech portions p\ 

5 _ _ 

When the repeated speech recognition step has been performed for each section of the 
text, a user can introduce further modifications referring to the segmentation of the text 
in step 320. According to the user introduced modifications of step 320, the method 
returns to the text segmenting step 312. Here, depending on the user's feedback, a new 
1 0 segmentation may take place and/or sections may be re-assigned to topics and labels. 

When the performed text segmentations of step 312 and the repeated speech recognition 
of step 318 are both accepted by the user, the method ends with step 322. 

15 The assignment between a topic and a section performed in step 3 16, as well as the 
speech transcription performed by step 304, can also make explicit use of a method of 
text segmentation and topic annotation as described in the patent application "Text 
segmentation and topic annotation for document structuring" and by the patent 
application 'Topic specific models for text formatting and speech recognition" filed by 

20 the same applicant herewith concurrently. 

In this way the expertise of a human proofreader can be universally and effectively 
coupled into a text segmentation and text labelling as well as into a corresponding 
speech recognition procedure. 

25 

Figure 4 shows a block diagram of a user interface of the present invention. The user 
interface 400 is preferably adapted as a graphical user interface. The user interface 400 
comprises a text window 402 and a suggestion window 404. The text that has been 
subject to text segmentation and label assignment is provided within the text window 
30 402. A label 406 that has been inserted as a section heading into the text is highlighted 
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for better retrieval within the text provided in the text window 402. When for example 
the user makes use of a pointer 408, the user can select the label 406 and in response to 
the selection of the label 406 a label list 410 is provided within the user interface. The 
label list 410 provides a whole set of labels 412, 414, 416 that serve as alternative labels 
5 that can be inserted instead of label 406 into the text. 



Additionally or alternatively the label list 410 can also be provided within the 
suggestion window 404. By means of the pointer 408 the user can select one of the 
labels 412, 414, 416 provided by the label list 410 to replace the label 406 in the given 

10 text. When none of the labels 406, 412, 414, 416 matches the user's preferences, the 
user can enter a label by making use of the user input field 418. Once an alternative 
label has been selected or entered by the user, the label 406 is replaced by the alternative 
label. In a similar way the segmentation of the text with alternative text segmentations 
in the form of alternative section boundaries is provided to the user and can be 

1 5 performed upon a user's selection. 

Figure 5 shows a block diagram of a segmentation system of the present invention. The 
segmentation system 500 comprises a graphical user interface 520, a structured text 
module 518 for storing structured text, a processing unit 516, a statistical model module 
20 514 storing statistical models, an unstructured text module 512 storing unstructured text 
and a speech recognition module 510 performing speech to text transcription. The 
segmentation system 500 is connected to an external storage device 508 and to an input 
device 504. A user 506 can interact with the segmentation system via the input device 
504 and the graphical user interface 520 of the segmentation system 500. 

25 

Speech 502 that is inputted into the segmentation system is processed by the speech 
recognition module 510. The speech recognition module 510 is connected to the 
unstructured text module 512 where the unstructured text resulting from the speech to 
text transcription process is stored. The unstructured text module 512 is connected to 
30 the processing unit 5 1 6 in order to provide the unstructured text to the processing unit 
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516. The processing unit 51 6 is bidirectionally connected to the statistical model 
module 514. By making use of the statistical information provided by the statistical 
models stored in the statistical model module 514, the processing unit 516 performs a 
text segmentation and label assignment to sections of the text on the basis of the 
5 unstructured text provided by the unstructured text module 512. The speech recognition 
module makes further use of the language models stored and provided by the statistical 
model module . In this way the statistical model module provides language models for 
the text segmentation as well as language models for the speech recognition. The latter 
are typically of a different type compared to language models for text segmentation 
10 because speech recognition usually makes use of trigrams whereas text segmentation 
usually employs unigrams. 

When the processing unit 516 has performed a text segmentation and an assignment of 
labels to text sections as section headings, the so generated structured text is stored in 

15 the structured text module 518. The structured text module is connected to the graphical 
user interface 520 in order to provide the structured text stored in the structured text 
module 518 to the user 506 by means of the graphical user interface 520. The user 506 
can interact via the input device 504 with the segmentation system. Therefore the input 
device 504 is connected to the graphical user interface 520 and to the processing unit 

20 516. When the user 506 introduces modifications of either the text structuring or the 

label assignment, the processing unit 516 performs a restructuring and a reassignment of 
the structured text stored in the structured text module 518. The restructured and 
reassigned structured text is repeatedly provided to the user until the performed 
modifications match the user's preferences. When no further changes are introduced by 

25 the user the structured text stored in the structured text module 5 1 8 is transmitted to the 
external storage device 508. 

Furthermore structured text stored in the structured text module 518 can also be 
exploited for improved speech recognition that is performed by means of the speech 
30 recognition module 510. Therefore, the structured text module 518 is directly connected 
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to the speech recognition module 510. Making use of this context specific feedback 
allows a more precise and specific speech recognition procedure to be performed by the 
speech recognition module 510. 

5 The invention therefore provides a method of document structuring and assigning of 
labels to text sections serving as section headings. Especially in the field of automatic 
speech recognition and automatic speech transcription the proofreading task to be 
performed by a human proof reader is extremely facilitated. For the proposed 
segmentation of the text, it is much easier for the proofreader to check whether the text 
10 following some heading really represents a section of the corresponding type as opposed 
to conventional proof reading procedures where a portion of text has to be examined, a 
section has to be determined and a heading has to be inserted into the text by jumping 
back to the beginning of a section. 

1 5 Furthermore the method supplies alternative section boundaries as well as alternative 
section labels that can easily be selected by the proof reader. Moreover during a proof 
reading process the system learns the most frequent corrections introduced by the proof 
reader and makes use of this information for future applications. 
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CLAIMS 



1 . A method of segmentation of a text" (512) into text sections and assigning a topic to 
each text section on the basis of annotated training data, the method comprising the 
steps of: 

- segmenting the text (512) into text sections by making use of statistical models 

(514) extracted from training data, 

- assigning a topic being indicative of the content of the text section to each text 

section by making use of the statistical models extracted from the training 
data, 

- generating a structured text by inserting a label as a section heading into the text 

in order to assign the label to the text section, 
« providing the structured text to a user (506), 

- processing of modifications of the structured text in response to a user's review. 

2. The method according to claim 1, wherein the topic assigned to a text section is 
further assigned to a set of labels (410), one of which being assigned to the text 
section and inserted as section heading into the text. 

3. The method according to claim 1 or 2, wherein providing the structured text to a 
user further comprises for each text section providing the set of labels (410) 
assigned to the topic that is assigned to the text section. 

4. The method according to any one of the claims 1 to 3, wherein the text 
modification comprises a modification of the segmentation of the text into 
sections and/or a modification of the assignment between a label and a text 
section. 
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The method according to claim 3 or 4, wherein the modification of the 
structured text comprises: . 

- assigning a label to a text section by selecting one label (412, 414,...) of the set 

of labels (410) assigned to the topic that is assigned to the text section, 

- re-defining a sectidn~boimdaiy by selectmgwassigned label (406) at a first 

position in the text and moving the assigned label to a second position 
within the text, the second position defining the section boundary, and the 
selected label defining the section heading, 

- entering a label and assigning the entered label to the text section. 

The method according to any one of the claims 1 to 5, wherein the processing of 
modifications of the structured text (518) comprises performing modifications in 
the text in response to the user's review and successively triggering the steps of: 

- re-segmenting the text into text sections by making use of the statistical models 

(514) extracted from the training data and by making reference to the 
performed modifications, 

- re-generating a structured text (5 1 8) by inserting a label as a section heading into 

the text by making reference to the performed modifications, assigning the 
label to the text section and providing the structured text to the user for 
review. 

The method according to any one of the claims 1 to 6, wherein the processing of 
modifications of the structured text comprises replacing a text portion by a label 
within the text, when the replaced text portion is identified as a formulation 
describing a section heading. 

The method according to any one of the claims 1 to 7, wherein the granularity of 
the text segmentation is controlled by the user by means of a customizable 
granularity parameter. 



20 



PHDE030397 EPP 



The method according to any one of the claims 1 to 8, wherein the modifications 
of the structured text are logged and analyzed in order to adapt the statistical 
models. 

A text segmentation system (500) for segmenting a text (512) into text sections 
and assigning a topic to each text section on the basis of annotated training data, 
the text segmentation system comprising: 

- means for segmenting the text (512) into text sections by making use of 

statistical models (514) extracted from the training data, 
~ means for assigning a topic being indicative of the content of the text section to 
each text section by making use of the statistical models extracted from the 
training data, the topic being further assigned to a set of labels (410), 

- means for generating a structured text (5 1 8) by inserting one label of the set of 

labels as a section heading into the text in order to assign the label to the text 
section, 

- means for providing (520) the structured text to a user (506), 

- means for processing (516) of modifications of the structured text in response to 

a user's review. 

The text segmentation system according to claim 10, wherein means for 
processing of modifications (516) of the structured text (518) are adapted to 
perform a modification of the segmentation of the text into sections and/or a 
modification of the assignment between a label and a text section. 

The text segmentation system according to claim 10 or 1 1, wherein means for 
processing of modifications of the structured text are further adapted to perform: 

- assigning a label to a text section by selecting one label (412, 414,...) of the set 
of labels (410) assigned to the topic that is assigned to the text section, 
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- re-defining a section boundary by selecting an assigned label (406) at a first 

position in the text and moving the assigned label to a second position 
within the text, the second position defining the section boundary, and the 
selected label defining the section heading, 
" - entering~arlabel and assigning the entered label to the text section. 

The text segmentation system according to any one of the claims 10 to 12, 
wherein the means for processing of modifications (5 1 6) of the structured text 
(5 18) are adapted to perform modifications in the text in response to the user's 
review and further comprising means for successively triggering the steps of: 

- re-segmenting the text into text sections by making use of the statistical models 

(514) extracted from the training data and by making reference to the 
performed modifications, 

- re-generating a structured text by inserting a label as a section heading into the 

text by making reference to the performed modifications, assigning the label 
to the text section and providing the structured text to a user for review. 

The text segmentation system according to any one of the claims 10 to 13, 
further comprising means for logging and analyzing the performed modifications 
of the structured text, the means for logging and analyzing being adapted to adapt 
the statistical models (514). 

A computer program product for segmenting a text (512) into text sections and 
assigning a topic to each text section on the basis of annotated training data, the 
computer program product comprising program means for: 

- segmenting the text into text sections by making use of statistical models (514) 

extracted from the training data, 

- assigning a topic being indicative of the content of the text section to each text 

section by making use of the statistical models extracted from the training 
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data, the topic being further assigned to a set of labels (410), 

- generating a structured text (518) by inserting one label (412, 414,...) of the set 

of labels (410) as a section heading into the text in order to assign the label 
to the text section, 

- providing the structured text to a user (506), " 

- processing of modifications of the structured text (518) in response to a user's 

review. 

The computer program product according to claim 15, wherein the program 
means for processing of modifications of the structured text are adapted to 
perform a modification of the segmentation of the text into sections and/or a 
modification of the assignment between a label and a text section, for the 
modification of the assignment between a label and a text section the program 
means are further adapted to perform the steps of: 

- assigning a label to a text section by selecting one label of the set of labels 
assigned to the topic that is assigned to the text section, 

- re-defining a section boundary by selecting an assigned label at a first position in 

the text and moving the assigned label to a second position within the text, 
the second position defining the section boundary, and the selected label 
defining the section heading, 

- entering a label and assigning the entered label to the text section. 

The computer program product according to claim 15 or 16, wherein the 
program means for processing of modifications of the structured text are adapted 
to perform modifications in the text in response to the user's review and further 
comprising program means for successively triggering the steps of: 

- re-segmenting the text into text sections by making use of the statistical models 

extracted from the training data and by making reference to the performed 
modifications, 
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- re-generating a structured text (518) by inserting a label as a section heading into 
the text by making reference to the performed modifications, assigning the 
label to the text section and providing the structured text to the user for 
review. 



A user interface (400) for a text segmentation system for segmenting a text into 
text sections and assigning a topic to each text section on the basis of annotated 

* 

training data, the user interface comprising: 

- means for providing the structured text to a user that has been structured by 

making use of statistical models extracted from the training data, 

- means for providing a set of labels (410) to the user, the set of labels being 

assigned to each topic that is assigned to each text section, 

- input means (408) for processing modifications of the structured text in response 

to a user's review, 

- means for logging and analyzing processed modifications of the structured text 

in order to train the statistical models. 

The user interface according to claim 18, wherein the structured text is provided 
to a user by means of a graphical user interface (400; 502) and wherein the input 
means (408, 418; 504) are adapted to process modifications of the structured text 
in form of the user selecting one label (412, 414,...) of the provided set of labels 
(410), the selected label being assigned to the text section. 

The user interface according to claim 1 8 or 19, further comprising means for 
providing text that has been re-segmented and re-labelled in response to the user's 
review by making use of the statistical models and by making reference to the 
processed modifications. 
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ABSTRACT 



Text segmentation and label assignment with user interaction by means of topic-specific 
language models 



5 The invention relates to a method, a computer prog ram product, a segmentation system 
and a user interface for structuring an unstructured text by making use of statistical 
models trained on annotated training data. The method performs text segmentation into 
text sections and assigns labels to text sections as section headings. The performed 
segmentation and assignment is provided to a user for general review- Additionally, 

10 alternative segmentations and label assignments are provided to the user being capable 
to select alternative segmentations and alternative labels as well as to enter a user 
defined segmentation and user defined label. In response to the modifications 
introduced by the user, a plurality of different actions are initiated incorporating the re- 
segmentation and re-labelling of successive parts of the document or the entire 

15 document Furthermore the method comprises a learning functionality, logging and 
analyzing user introduced modifications for adaptation of the method to the user's 
preferences and for further training of the statistical models. 

(figure 2) 
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